data commons
Community search signatures as foundation features for human-centered geospatial modeling
Sun, Mimi, Kamath, Chaitanya, Agarwal, Mohit, Muslim, Arbaaz, Yee, Hector, Schottlander, David, Bavadekar, Shailesh, Efron, Niv, Shetty, Shravya, Prasad, Gautam
Aggregated relative search frequencies offer a unique composite signal reflecting people's habits, concerns, interests, intents, and general information needs, which are not found in other readily available datasets. Temporal search trends have been successfully used in time series modeling across a variety of domains such as infectious diseases, unemployment rates, and retail sales. However, most existing applications require curating specialized datasets of individual keywords, queries, or query clusters, and the search data need to be temporally aligned with the outcome variable of interest. We propose a novel approach for generating an aggregated and anonymized representation of search interest as foundation features at the community level for geospatial modeling. We benchmark these features using spatial datasets across multiple domains. In zip codes with a population greater than 3000 that cover over 95% of the contiguous US population, our models for predicting missing values in a 20% set of holdout counties achieve an average $R^2$ score of 0.74 across 21 health variables, and 0.80 across 6 demographic and environmental variables. Our results demonstrate that these search features can be used for spatial predictions without strict temporal alignment, and that the resulting models outperform spatial interpolation and state of the art methods using satellite imagery features.
- Europe > Austria > Vienna (0.14)
- North America > United States > Texas > Harris County (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > United Kingdom (0.04)
- Research Report > Promising Solution (0.54)
- Research Report > New Finding (0.54)
Google's new tool lets large language models fact-check their responses
The first of the two methods is called Retrieval-Interleaved Generation (RIG), which acts as a sort of fact-checker. If a user prompts the model with a question--like "Has the use of renewable energy sources increased in the world?"--the model will come up with a "first draft" answer. Then RIG identifies what portions of the draft answer could be checked against Google's Data Commons, a massive repository of data and statistics from reliable sources like the United Nations or the Centers for Disease Control and Prevention. Next, it runs those checks and replaces any incorrect original guesses with correct facts. It also cites its sources to the user.
Knowing When to Ask -- Bridging Large Language Models and Data
Radhakrishnan, Prashanth, Chen, Jennifer, Xu, Bo, Ramaswami, Prem, Pho, Hannah, Olmos, Adriana, Manyika, James, Guha, R. V.
Large Language Models (LLMs) are prone to generating factually incorrect information when responding to queries that involve numerical and statistical data or other timely facts. In this paper, we present an approach for enhancing the accuracy of LLMs by integrating them with Data Commons, a vast, open-source repository of public statistics from trusted organizations like the United Nations (UN), Center for Disease Control and Prevention (CDC) and global census bureaus. We explore two primary methods: Retrieval Interleaved Generation (RIG), where the LLM is trained to produce natural language queries to retrieve data from Data Commons, and Retrieval Augmented Generation (RAG), where relevant data tables are fetched from Data Commons and used to augment the LLM's prompt. We evaluate these methods on a diverse set of queries, demonstrating their effectiveness in improving the factual accuracy of LLM outputs. Our work represents an early step towards building more trustworthy and reliable LLMs that are grounded in verifiable statistical data and capable of complex factual reasoning.
- North America > United States > California > San Francisco County > San Francisco (0.30)
- North America > United States > California > Santa Clara County > Mountain View (0.14)
- North America > United States > California > Sonoma County (0.05)
- (11 more...)
Building Flexible, Scalable, and Machine Learning-ready Multimodal Oncology Datasets
Tripathi, Aakash, Waqas, Asim, Venkatesan, Kavya, Yilmaz, Yasin, Rasool, Ghulam
The advancements in data acquisition, storage, and processing techniques have resulted in the rapid growth of heterogeneous medical data. Integrating radiological scans, histopathology images, and molecular information with clinical data is essential for developing a holistic understanding of the disease and optimizing treatment. The need for integrating data from multiple sources is further pronounced in complex diseases such as cancer for enabling precision medicine and personalized treatments. This work proposes Multimodal Integration of Oncology Data System (MINDS) - a flexible, scalable, and cost-effective metadata framework for efficiently fusing disparate data from public sources such as the Cancer Research Data Commons (CRDC) into an interconnected, patient-centric framework. MINDS offers an interface for exploring relationships across data types and building cohorts for developing large-scale multimodal machine learning models. By harmonizing multimodal data, MINDS aims to potentially empower researchers with greater analytical ability to uncover diagnostic and prognostic insights and enable evidence-based personalized care. MINDS tracks granular end-to-end data provenance, ensuring reproducibility and transparency. The cloud-native architecture of MINDS can handle exponential data growth in a secure, cost-optimized manner while ensuring substantial storage optimization, replication avoidance, and dynamic access capabilities. Auto-scaling, access controls, and other mechanisms guarantee pipelines' scalability and security. MINDS overcomes the limitations of existing biomedical data silos via an interoperable metadata-driven approach that represents a pivotal step toward the future of oncology data integration.
- North America > United States > Florida (0.04)
- North America > United States > Oregon (0.04)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
- Europe > Middle East > Malta > Northern Region > Western District > Attard (0.04)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Diagnostic Medicine (1.00)
Text2Cohort: Facilitating Intuitive Access to Biomedical Data with Natural Language Cohort Discovery
Kulkarni, Pranav, Kanhere, Adway, Yi, Paul H., Parekh, Vishwa S.
The Imaging Data Commons (IDC) is a cloud-based database that provides researchers with open access to cancer imaging data, with the goal of facilitating collaboration. However, cohort discovery within the IDC database has a significant technical learning curve. Recently, large language models (LLM) have demonstrated exceptional utility for natural language processing tasks. We developed Text2Cohort, a LLM-powered toolkit to facilitate user-friendly natural language cohort discovery in the IDC. Our method translates user input into IDC queries using grounding techniques and returns the query's response. We evaluate Text2Cohort on 50 natural language inputs, from information extraction to cohort discovery. Our toolkit successfully generated responses with an 88% accuracy and 0.94 F1 score. We demonstrate that Text2Cohort can enable researchers to discover and curate cohorts on IDC with high levels of accuracy using natural language in a more intuitive and user-friendly way.
Data Commons
Guha, Ramanathan V., Radhakrishnan, Prashanth, Xu, Bo, Sun, Wei, Au, Carolyn, Tirumali, Ajai, Amjad, Muhammad J., Piekos, Samantha, Diaz, Natalie, Chen, Jennifer, Wu, Julia, Ramaswami, Prem, Manyika, James
Publicly available data from open sources (e.g., United States Census Bureau (Census), World Health Organization (WHO), Intergovernmental Panel on Climate Change (IPCC)) are vital resources for policy makers, students and researchers across different disciplines. Combining data from different sources requires the user to reconcile the differences in schemas, formats, assumptions, and more. This data wrangling is time consuming, tedious and needs to be repeated by every user of the data. Our goal with Data Commons (DC) is to help make public data accessible and useful to those who want to understand this data and use it to solve societal challenges and opportunities. We do the data processing and make the processed data widely available via standard schemas and Cloud APIs. Data Commons is a distributed network of sites that publish data in a common schema and interoperate using the Data Commons APIs. Data from different Data Commons can be joined easily. The aggregate of these Data Commons can be viewed as a single Knowledge Graph. This Knowledge Graph can then be searched over using Natural Language questions utilizing advances in Large Language Models. This paper describes the architecture of Data Commons, some of the major deployments and highlights directions for future work.
AI and the tyranny of the data commons
I am here to tell you the sad but true story of the demise of the sharing economy. Remember how we were told, back in the 1990s and 2000s, that we were contributing to the creation of the largest commons known to humanity? Well, to paraphrase The Lord of the Rings, we were all of us deceived, for another ring was made. Artificial intelligence (AI) is making that clearer than ever. The free data we generated by spending thousands of hours on Big Tech's platforms has been appropriated and converted into training data for AI models.
- North America > United States (0.05)
- Europe > Switzerland (0.05)
- Europe > Spain (0.05)
- (2 more...)
The Price of Your AI-Generated Selfie
The recent flooding of social media feeds with AI-generated "portraits" derived from databases of artists' work has renewed conversation over data ownership and the potential power AI has to supplant livelihoods in the future. The 22 million individuals and counting who have already handed over their images to the Lensa application might be fine to receive the myriad of AI-illustrated images in exchange for their data. But the fundamental rights, principles, and freedoms users are giving up during this exchange remains largely unchecked. In Web3 technology circles, much promises have been made of decentralized technologies to open up the possibility for individual ownership and monetization of data, returning power to "creators." This reflects the political ethos held by Blockchain proponents like Etherum co-founder Joe Lubin, who ostensibly seek to supplant the existing power structures of finance through "permissionless" consensus-based transaction data structures.
- Asia > India (0.15)
- North America > United States > Virginia (0.05)
- North America > United States > Utah (0.05)
- (5 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (0.97)
Building a better data economy
It's "time to wake up and do a better job," says publisher Tim O'Reilly--from getting serious about climate change to building a better data economy. And the way a better data economy is built is through data commons--or data as a common resource--not as the giant tech companies are acting now, which is not just keeping data to themselves but profiting from our data and causing us harm in the process. "When companies are using the data they collect for our benefit, it's a great deal," says O'Reilly, founder and CEO of O'Reilly Media. "When companies are using it to manipulate us, or to direct us in a way that hurts us, or that enhances their market power at the expense of competitors who might provide us better value, then they're harming us with our data." And that's the next big thing he's researching: a specific type of harm that happens when tech companies use data against us to shape what we see, hear, and believe. It's what O'Reilly calls "algorithmic rents," which uses data, algorithms, and user interface design as a way of controlling who gets what information and why. Unfortunately, one only has to look at the news to see the rapid spread of misinformation on the internet tied to unrest in countries across the world. We can ask who profits, but perhaps the better question is "who suffers?" According to O'Reilly, "If you build an economy where you're taking more out of the system than you're putting back or that you're creating, then guess what, you're not long for this world." That really matters because users of this technology need to stop thinking about the worth of individual data and what it means when very few companies control that data, even when it's more valuable in the open. After all, there are "consequences of not creating enough value for others." We're now approaching a different idea: what if it's actually time to start rethinking capitalism as a whole? "It's a really great time for us to be talking about how do we want to change capitalism, because we change it every 30, 40 years," O'Reilly says. He clarifies that this is not about abolishing capitalism, but what we have isn't good enough anymore. "We actually have to do better, and we can do better. And to me better is defined by increasing prosperity for everyone."
- North America > United States > California (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Germany (0.04)
- Law (1.00)
- Banking & Finance (1.00)
- Information Technology > Services (0.68)
- Government > Regional Government > North America Government > United States Government (0.46)
Google launches a suite of tech-powered tools for reporters, Journalist Studio – TechCrunch
Google is putting AI and machine learning technologies into the hands of journalists. The company this morning announced a suite of new tools, Journalist Studio, that will allow reporters to do their work more easily. At launch, the suite includes a host of existing tools as well as two new products aimed at helping reporters search across large documents and visualizing data. The first tool is called Pinpoint and is designed to help reporters work with large file sets -- like those that contain hundreds of thousands of documents. Pinpoint will work as an alternative to using the "Ctrl F" function to manually seek out specific keywords in the documents.
- North America > United States (0.17)
- North America > Mexico (0.05)
- Asia > Philippines (0.05)